=======================================================

Scott Burns

Udacity – Data Analyst Nanodegree

Project 3: Data Analysis with R

========================================================

Introduction and Background

Why I chose this dataset

I live in the city of Oakland, California, and plan to stay here. I’m curious about crime trends in the city, as they may have a direct impact on my life and important decisions I encounter in the future. Thus as I was exploring dataset options for this project, I was excited to uncover this dataset on the website OpenOakland.org - which has records of all crime reports from the city of Oakland from 2007 to earlier this summer, including incident details such as geographic location and crime type. It looked fascinating to me, and I decided to dive in.

Questions initially considered

  1. How has the incidence of crime been trending in the city over the last 8 years?
  2. Have incidences of particular types of crime been growing / diminishing differentially?
  3. Where is crime occurring geographically?
  4. How has crime incidence changed in particular areas within the city?

Primarily, my focus was on generating a few thought-provoking visualizations, to see if they might suggest directions for more in-depth exploration.

About the dataset

The data were downloaded from data.openoakland.org.

Additional background on the dataset is available on Rik Belew’s blog.

Background on the dataset’s CrimeCat classifications can be found on this explainer page.

More detailed background on the dataset is available in Dataset Description section at the end of this file.


Stream-of-Consciousness Exploration

Getting Started: Preparing the RStudio environment

In preparation for analysis, I loaded in the dataset of interest, and glanced at summary information about it (output suppressed).

I also created two variables to potentially use throughout subsequent plots.

  1. An any_crime dummy variable. For this, any record where Desc and CrimeCat are not blank is assigned a 1 value. If a record doesn’t even have this minimal information about the crime, I think it’s better to exclude it as an incident.

  2. Add column date_format with date representation of Date string

Initial Exploration: Rough plots of crime data

As a first basic attempt to visualize the data, I plotted a histogram of crime reports, using a binwidth of 30 days.

From the histogram, we see indications that crime reports have fallen significantly in Oakland over the last 7 years, as report counts per 30-day period in 2007 and 2008 appear to number between 9,000 and 10,000, while counts per period have been around 4,000 in recent years.

I was also curious how crimes were distributed throughout the day over the period in question. To see the dynamics, I plotted another histogram by time of day, with hourly bins, creating an hour variable to more easily bin incidents.

The hourly distribution suprised me - first by the huge spike at hour 0 (e.g. midnight to 1am). I suspected that this might be related to data entry issues, so I also checked the composition of crimes that were committed at hour zero, loading them from the original Time variable. From this, I discovered that 98,823 of the 118,590 one_am records have the value exactly “0:00:00”. This strengthened my suspicion about the spike being driven by coding issues.

Relatively confident that most of the 98,823 records occurring at midnight were probably set in the absence of a known time for a crime, or as a result of time recording errors, I decided to run the histogram on a subset that removed the ‘exactly midnight’ crimes.

## 0:00:00 0:30:00 0:01:00 0:15:00 0:05:00 0:45:00 0:10:00 0:20:00 0:50:00 
##   98823    3030    2517    1215    1000     913     876     766     641 
## 0:40:00 0:25:00 0:35:00 0:55:00 0:02:00 0:08:00 0:06:00 0:48:00 0:23:00 
##     635     461     437     410     207     198     188     186     185 
## 0:03:00 0:17:00 0:16:00 0:13:00 0:22:00 0:19:00 0:31:00 0:39:00 0:29:00 
##     172     170     164     162     162     156     156     156     155 
## 0:04:00 0:44:00 0:43:00 0:42:00 0:18:00 0:14:00 0:53:00 0:24:00 0:52:00 
##     153     153     152     150     149     147     147     146     146 
## 0:07:00 0:37:00 0:32:00 0:38:00 0:12:00 0:33:00 0:34:00 0:21:00 0:11:00 
##     145     145     144     144     143     143     143     141     140 
## 0:58:00 0:51:00 0:27:00 0:28:00 0:36:00 0:41:00 0:59:00 0:54:00 0:47:00 
##     139     134     129     129     127     122     122     121     120 
## 0:09:00 0:46:00 0:56:00 0:49:00 0:57:00 0:26:00 0:01:19 0:03:59 0:04:16 
##     118     114     112     111     110      98       1       1       1 
## 0:08:05 0:08:30 0:10:25 0:12:22 0:19:12 0:33:02 0:33:46 0:35:20 0:42:49 
##       1       1       1       1       1       1       1       1       1 
## 1:00:00 1:00:07 1:01:00 1:01:53 1:02:00 1:03:00 1:04:00 1:04:30 1:05:00 
##       0       0       0       0       0       0       0       0       0 
## 1:06:00 1:06:10 1:07:00 1:08:00 1:09:00 1:10:00 1:11:00 1:12:00 1:13:00 
##       0       0       0       0       0       0       0       0       0 
## 1:14:00 1:15:00 1:16:00 1:17:00 1:18:00 1:19:00 1:19:27 1:20:00 1:21:00 
##       0       0       0       0       0       0       0       0       0 
## (Other) 
##       0

In the modified histogram, I was also somewhat surprised that crimes don’t spike to a greater extent in the evenings, with incident rates staying fairly uniform from 10am to 4pm, then rising to a moderate peak around 7pm and declining from there to a minimum at the 5am hour.

Next, I wanted to explore the distribution of crime by latitude and longitude, with histograms for each variable.

After a first glance at raw plots for each, I realized I needed to remove a few obviously incorrect entries, to arrive at the histograms below, and obtained more useful ranges for a chart by summarizing reasonable values for Lat.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   37.35   37.77   37.79   37.79   37.81   38.34

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -123.0  -122.3  -122.2  -122.2  -122.2  -122.1

Crime seems to skew toward the mid-Northern and Western parts of the city, based on the histogram.

For a more detailed view of geographic distribution, I decided to use ggmap to plot a basic heatmap of crime in Oakland for the period in question.

In the heatmap, we see a strong concentration of crimes in the downtown area, with a pocket also slightly to the northwest along San Pablo Avenue, then all along International Avenue to the southeast.

For another view of Oakland crime dynamics, I created a line plot of incidents per day, using the dplyr aggregation techniques we learned in the Data Analysis in R course.

Note: creating new dataframe based on incidents per day, using dplyr ‘verbose’ method, then using this in line plot.

In the daily line plot, the decline of crime reports over time is clearly visible.

As daily crime incident counts are fairly noisy, I wanted to take a look at potentially smoother time increments - creating a line plot of any_crime incidents each month.

To create the monthly plot, I cut our date_format data in monthly units, using the syntax we learned in lesson 5 of the Data Analysis in R course, then aggregated crime data by month. To do so, I applied dplyr methods again, but this time using the ‘concise’ syntax.

The downward trend in crime reports is again evident in plotted monthly crime report data. Another striking feature of the chart is the large vertical drop in incident count at the beginning of 2014.

It seems very strange to me that the reported number of crimes would be fairly constant throughout 2012 and 2013, fall by around 40% as the year changed, then persist at a largely constant lower level around 3,700 incidents per month for the next year. I wonder if there was a major change in the way crimes were recorded or reported starting at the beginning of 2014.

To better understand the distribution of daily crime rates, I also plotted mean, median and quartile measures of daily crime rates.

Also - to note - I was seeing several report data points for dates in the future. I assume these were the result of data entry errors. For all plots going forward, I have chosen to subset by incidents occurring prior to a ‘cut-off’ date of June 2015.

Mean and median daily crime rates:

Initial Exploration: Rough plots with incidents segmented by crime type

After having uncovered some interesting insights about local crime trends at an aggregate level, I wanted to dive down into crime segments, using some of the multi-variate visualization techniques we covered in the later lessons of Data Analysis in R

To understand how I could best group crime report incidents by description, I surveyed all the potential values appearing in the columns CrimeCat and Desc.

I assessed unique(crimes$Desc) but suppressed the output for this column because there are about 1800 unique descriptions in the fields. This would not be a useful coulmn to use in grouping for plots.

On the other hand, the CrimeCat column includes about 60 unique categories as verified by unique(crimes$CrimeCat) This number is too large for tractable visualizations, but I decided I could group these into a smaller number of main categories (variable - mainCat) using grepl string matching.

Thus I grouped crimes by homicide, robbery/larceny, assault, rape, weapons, domestic violence, traffic and court viaolations and ‘quality of life’ (this was a class of crimes noted in the dataset, which included somewhat minor violations including ‘curfew-loitering’, drug possession and incidents related to public liquor consumption).

I then surveyed examples from the output with head(crimes,50) and tail(crimes,50) to ensure the new column assignment worked correctly.

As in previous examples, I created a new dataframe with dplyr functions, this time grouping by both date and the new mainCat variable.

The results are fairly messy with daily measures, so I decided to create a similar line plot with mainCat groupings but monthly crime incidents on the x-axis. For this I built a new crime_types_by_month dataframe, as seen below.

Interesting trends are visible when grouping incidents by type in a single plot, but the output is still fairly messy and dynamics for some categories are hard to discern, as the scales of total incidents in each crime category are substantially different.

Thus, I decided to re-plot crime_types_by_month in a facet wrap with ‘free_y’ scale to better view dynamics by crime type.

I found this chart to be striking - as incidents for all the main crime categories appear to have dropped significantly, while each shows a different pattern of decline. Some categries like Assault and Robbery plummeted around 2010 then remained steady, with others - like Homicide, Traffic, Domestic Violence and Other showing big drops later, around 2013 and 2014.

The strange discontinuity at the 2014 year mark I noted earlier is also present in these (the later declining) categories, with incident counts holding steady in 2014 and 2015 after falling massively from much higher levels immediately before 2013 year-end.

Initial Exploration: Rough plots with incidents segmented by location

Besides tracking crime dynamics by type, I also wanted to explore how crimes were distributed geographically by Police Beats. To do so, I used the Beat variable, which indicates where the crime occurred/was recorded. More detail on the ‘Beat’ variable is available in the Dataset description below and at this link. There are 135 distinct police beats included as Beat values, with some values strictly numeric (like ‘31’) and others having alphanumeric identifiers (like ‘26X’).

Links are available, as provided by the Oakland Police Department (here here)

Similar to the previous exploration, I built a new dataframe, grouping on incidents per month and Beat.

First I plotted month and beat dynamics on one chart, associating each beat with a color. As seen, result was far too crowded to be useful. I thought perhaps a stacked bar chart of incidents by beat might show more insight - also plotted below. Clear insight wasn’t visible in the column plots either.

Going forward I decided that there are too many police beats to effectively display in a chart, and wanted to take the top 20 beats by total crime reports, then group incidents in other beats all under ‘Other’. As you’ll see in calculations below - these top 20 beats cover about 52% of all crime reports with descriptions (our any_crime variable).

First I created dataframe with total crime incidents regardless of date.

Then I added another column to the crimes - mainBeat, perserving the value for the top 20 beats by incidents, and labeling ‘other’ for all other beats.

Following the transformation, I revised the crime_by_month_and_beat dataframe with the mainBeat variable to produce clearer output, re-plotting with facet wrap to more clearly see the crime incident dynamic by beat.

## [1] 0.5221023

##      month               mainBeat      incidents            n         
##  Min.   :2007-01-01   04X    : 101   Min.   :  41.0   Min.   :  42.0  
##  1st Qu.:2009-02-01   06X    : 101   1st Qu.: 123.0   1st Qu.: 130.0  
##  Median :2011-03-01   07X    : 101   Median : 164.0   Median : 175.0  
##  Mean   :2011-03-02   08X    : 101   Mean   : 309.7   Mean   : 326.4  
##  3rd Qu.:2013-04-01   19X    : 101   3rd Qu.: 223.0   3rd Qu.: 232.0  
##  Max.   :2015-05-01   20X    : 101   Max.   :4623.0   Max.   :4855.0  
##                       (Other):1515

Again, in the per-Beat facet view we see uniformly down-trending crime incidence over time, with some beats experiencing more pronounced local spikes up around mid- to end of 2013. For some beats, including 04X, 08X, 20X and others, we also see the strange discontinuity at the end of 2013 that appeared in other views of monthly crime over time.

For these police beats, crime stays steady or spikes toward the end of 2013, then precipitously drops right at the new year, and remains or declines from the lower level to the present day.

Mapping trends to geography, we can see where each of the top 20 police beats are located in our map below.

Oakland Police Beat Map Courtesy of the Oakland Police Department

Note that many of the top beats by crime incidence correspond to the ‘hotter’ regions in downtown Oakland as shown in our earlier exploratory heatmap, particularly ‘08X’, ‘04X’ and ‘06X’.

As a reminder, we can see a modified heatmap below, plotted over a Google roadmap.

I chose to loop back on the question of geographic distribution over time with a comparison of crime heatmaps for periods at the beginning and end of our dataset. Two heatmaps are arranged with a similar density scale to see what the distribution of crime looked like in Oakland during 2008 vs 2014. Consistent with earlier analysis, by 2014 the map is ‘cooler’ overall, with bright hotspots more subdued.

In focusing primarily on the highest-crime beats in the last few graphs, changes in the distribution of crime across beats were less obvious, so I decided to create a plot to fill that gap. The following is a plot of centralilty measures for monthly crime incidents by police beat. I thought the results might also offer suggestions for exploring the logic and purpose of police beat demarcation.

Monthly crime reports by Police Beat - mean, median, top decile and bottom decile of incidents

A few things were striking to me about these plots:

  • Mean and median crime rates by beat are tightly aligned, suggesting that crime incidence is fairly evenly distributed over beats.

  • The steep decline in crime seen in the top-20 police beats by crime rate is not so evident for median and police beats from lower-quantile segments - in fact, the crime threshold for the bottom decile of police beats actually increased from 2008 levels. Thus most of the aggregate reduction in crime over the last few years seems to have come from declining crime rates in the highest-crime police beats.

  • The distribution of crime rates across police beats has narrowed significantly as a result of the above trends - the ratio of monthly crimes in the top decile police beats vs the bottom decile was about 275 / 5 before 2008 but around 120 / 10 by 2014.

Also - the spike in incidents for bottom quartile police beats by crime rate in 2010 is somewhat strange. I suspect it might be an artifact of new police beats being created at that time.


Final Plots and Summary

For my final overview plots I chose to sharpen and adjust a few views we looked at in the Initial Exploration section, and to refine the associated fitted curves with linear models for each plot, instead of using the standard non-parametric geom_smooth function.

Final Plot 1: Aggregate view of weekly crime incidence in dataset

For the aggregate crime trends chart, I decided to use weekly crime incident data on my x-axis, as this could provide a balance between the bias and variance poles of daily and monthly periods from earlier plots. I also plotted using the log of y, and plotted with a linear model, looking run a similar regression to arrive at usefully interpretable coeffficients.

First, weekly analysis requires weekly units, and I cut date_format accordingly.

For my final aggregate chart, I also wanted to set up a simple model for assessing crime trends. Looking at our trend line, I note that a linear model provides a very strong fit, with a F-test p-value of close to zero and a R-squared of around 0.82.

Based on the coefficient estimated, it looks like crime has been trending down at a rate of 11% per year on average over the dataset period.

Final Plot 2: View of monthly crime, faceted by crime type

In my second chart, I wanted to refine the faceted-by-crime-type charts I produced in the exploratory analysis, but use corresponding regression lines based on linear models for each facet as in Final Plot 1. For the facets, I decided to draw on monthly crime statistics, as weekly figures in certain categories were very sparse with too much variance, and I left the y-scale free.

Note that the discontinuities in crime reduction around the beginning of 2014 are even more striking when plotting the log of incidents on the y axis.

To test how the inclusion of crime type variables might shape the fit of our trend line, I decided to also include mainCat as a independent variable in a linear regression model. The addition seems to improve fit slightly versus a model of incidents vs time. R-squared for the new model is 0.85.

Reflection

My biggest struggles with the dataset related to the process of grouping data into dataframes with incidents over useful time periods, transforming data to proper formats, segmenting by tractable crime and police beat groups and properly formatting my final plots. Going through the steps required for the output here, I learned much about string matching, dataframe reshaping and subsetting in R, as well as heatmap plotting with ggmap and themes for ggplot.

Once the data were properly grouped and I could run plots, I found the output to be fascinating. Coming across the dataset was a huge success for me, as it is a rich source of insight on a topic in which I have a strong interest. Approaches to plotting learned in the Data Analysis in R course proved useful for visualizing elements of the dataset to reveal the evidence highlighted above.

The exploratory analysis here prompts me to explore a few more questions on the crime data. A few that are top-of-mind I outline below:

  1. What caused the massive, sudden drop at the beginning of 2014 in reported crimes for many crime categories? The discontinuity at the change in the year is so striking, it seems that it must be due to a change in policing or reporting policy, rather than a strange and dramatic drop-off in crimes committed.
  2. How are crime types and police beats correlated? Are there specific regions associated with particular crimes?
  3. What does the heatmap look like for specific crime types (e.g. violent crime or robbery)?
  4. (with additional data) How have shifts crime rates corresponded to changes in economic outcomes, property values and educational indicators in Oakland?

In my explorations, I also came across many references to packages that might be useful for time-series analysis in the future, such as ‘zoo’. I’d like to try these out on other datasets.

Also, I see the need to become more familiar with using functions when plotting in R, so that I can better re-apply logic, plot formatting and continue to eliminate code repitition.


Refences and Sources

Melt: http://www.r-bloggers.com/melt/

Reshape background: http://seananderson.ca/2013/10/19/reshape.html

Melting for time series: http://stackoverflow.com/questions/1181060/reshaping-time-series-data-from-wide-to-tall-format-for-plotting

Reshaping data: http://www.r-bloggers.com/reshape-and-aggregate-data-with-the-r-package-reshape2/

Creating columns with if-else statements: http://stackoverflow.com/questions/13672781/populate-a-column-using-if-statements-in-r

String matching with grepl: http://www.endmemo.com/program/R/grepl.php

Choosing between regression models: http://stats.stackexchange.com/questions/43930/choosing-between-lm-and-glm-for-a-log-transformed-response-variable

Creating graph titles with ggplot: http://www.cookbook-r.com/Graphs/Titles_(ggplot2)/ http://zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2-cheatsheet-3/

Overlaying fitted regressions in ggplot: http://stackoverflow.com/questions/1476185/how-to-overlay-a-line-for-an-lm-object-on-a-ggplot2-scatterplot http://stackoverflow.com/questions/10528631/add-exp-power-trend-line-to-a-ggplot

Working with ggmap: https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/ggmap/ggmapCheatsheet.pdf https://cran.r-project.org/web/packages/ggmap/ggmap.pdf http://www.geo.ut.ee/aasa/LOOM02331/heatmap_in_R.html

Formatting plots: http://www.cookbook-r.com/Graphs/Legends_(ggplot2)/ http://stackoverflow.com/questions/10035786/ggplot2-y-axis-label-decimal-precision http://docs.ggplot2.org/current/theme.html http://docs.ggplot2.org/dev/vignettes/themes.html


Dataset Description

Crime reports from Oakland: January 2007 to July 2015

Description

A dataset containing discriptions, event timing, and geographic information from over 690,000 crime reports in Oakland during the period 2007 to mid-2015

Usage

Read in the dataset from csv

Format

A data frame with 696372 rows and 21 variables

Details

See more details on the dataset in a wiki page built by Rik Belew, at this link, with an associated Data Dictionary here.

  • Idx. arbitrary unique identifier (0 - 696,372)
  • OPD_RD. the crime identifier originally assigned by OPD
  • OIdx. OIdx: This field ranges from zero up to the number of individual USC (see below) records associated with a particular incident. Recall, while OPD only reveals a single record associated with each crime incident, multiple crime records are maintained internally. If one or more USC (see below) records are available, they are indexed with numbers 1, 2, 3 etc; when there are none, OIdx=0.
  • Date. This is the date associated with the crime by OPD
  • Time. This is the time associated with the crime by OPD
  • CType. This is the text string crime description provided by OPD
  • Desc. This is a more detailed text string crime provided by OPD
  • Beat. The police beat associated with the crime by OPD (135 unique records - see below for beat map)
  • Addr. Geographic location of incident - see geographic note below
  • Lat. Geographic location of incident - see geographic note below
  • Lng. Geographic location of incident - see geographic note below
  • Src. Source of data
  • UCR. Indicator used by USC* in compiling crime data
  • Statute. Legal statute violation associated with crime

Geographic note: The original source of these variables was the OPD record. However, because addresses act as critique link to geocoding (latitude longitude coordinates), special procedures were used to normalize and “cache” address strings used more than once. This allows the efficiency of minimizing the number of required geocoding queries. A by product of this process is that more complete, normalized addresses generated in this case. In particular, these generally include a zip code. Note this allows “extrapolation” from crimes for which addresses were geocoded, to provide geocodes for other crimes sharing the same address.

*USC: Urban Strategies Council - organization contributing to the dataset

Courtesy of the Oakland Police Department